interface element
InSight-R: A Framework for Risk-informed Human Failure Event Identification and Interface-Induced Risk Assessment Driven by AutoGraph
Xiao, Xingyu, Tong, Jiejuan, Chen, Peng, Sun, Jun, Sui, Zhe, Liang, Jingang, Zhao, Hongru, Zhao, Jun, Wang, Haitao
Human reliability remains a critical concern in safety-critical domains such as nuclear power, where operational failures are often linked to human error. While conventional human reliability analysis (HRA) methods have been widely adopted, they rely heavily on expert judgment for identifying human failure events (HFEs) and assigning performance influencing factors (PIFs). This reliance introduces challenges related to reproducibility, subjectivity, and limited integration of interface-level data. In particular, current approaches lack the capacity to rigorously assess how human-machine interface design contributes to operator performance variability and error susceptibility. To address these limitations, this study proposes a framework for risk-informed human failure event identification and interface-induced risk assessment driven by AutoGraph (InSight-R). By linking empirical behavioral data to the interface-embedded knowledge graph (IE-KG) constructed by the automated graph-based execution framework (Auto-Graph), the InSight-R framework enables automated HFE identification based on both error-prone and time-deviated operational paths. Furthermore, we discuss the relationship between designer-user conflicts and human error. This framework offers actionable insights for interface design optimization and contributes to the advancement of mechanism-driven HRA methodologies. Keywords: Knowledge-Graph-Driven, Automated, Interface-Induced Risk, Human Error Identification 1 Introduction Human error remains a leading contributor to failures in complex socio-technical systems such as nuclear power plants, aviation, and healthcare, where safety-critical operations depend on accurate and timely human decisions [1, 2]. Human reliability analysis (HRA) methods have been widely used to model operator behavior and assess the likelihood of human failure events (HFEs) [3]. However, prevailing HRA approaches are often constrained by their reliance on expert judgment, particularly in the identification of HFEs and the assignment of performance influencing factors (PIFs) [3, 4]. In traditional HRA frameworks such as the integrated human event analysis system for event and condition assessment (IDHEAS-ECA), HFEs are primarily determined through expert elicitation, a process that, while practical, suffers from limited reproducibility, insufficient transparency, and weak theoretical grounding [5].
Understanding GUI Agent Localization Biases through Logit Sharpness
Tao, Xingjian, Wang, Yiwei, Cai, Yujun, Yang, Zhicheng, Tang, Jing
Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions. Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability. We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics. To better quantify model uncertainty, we introduce the Peak Sharpness Score (PSS), a metric that evaluates the alignment between semantic continuity and logits distribution in coordinate prediction. Building on this insight, we further propose Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Extensive experiments demonstrate that our framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior.
Fatigue-Aware Adaptive Interfaces for Wearable Devices Using Deep Learning
Wearable devices, such as smartwatches and head-mounted displays, are increasingly used for prolonged tasks like remote learning and work, but sustained interaction often leads to user fatigue, reducing efficiency and engagement. This study proposes a fatigue-aware adaptive interface system for wearable devices that leverages deep learning to analyze physiological data (e.g., heart rate, eye movement) and dynamically adjust interface elements to mitigate cognitive load. The system employs multimodal learning to process physiological and contextual inputs and reinforcement learning to optimize interface features like text size, notification frequency, and visual contrast. Experimental results show a 18% reduction in cognitive load and a 22% improvement in user satisfaction compared to static interfaces, particularly for users engaged in prolonged tasks. This approach enhances accessibility and usability in wearable computing environments.
Large Language Model-Brained GUI Agents: A Survey
Zhang, Chaoyun, He, Shilin, Qian, Jiaxu, Li, Bowen, Li, Liqun, Qin, Si, Kang, Yu, Ma, Minghua, Liu, Guyue, Lin, Qingwei, Rajmohan, Saravan, Zhang, Dongmei, Zhang, Qi
GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.
Sharingan: Extract User Action Sequence from Desktop Recordings
Chen, Yanting, Ren, Yi, Qin, Xiaoting, Zhang, Jue, Yuan, Kehong, Han, Lu, Lin, Qingwei, Zhang, Dongmei, Rajmohan, Saravan, Zhang, Qi
Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Models (VLMs) and their increasing use in video analysis, extracting user actions from desktop recordings remains an underexplored area. This paper addresses this gap by proposing two novel VLM-based methods for user action extraction: the Direct Frame-Based Approach (DF), which inputs sampled frames directly into VLMs, and the Differential Frame-Based Approach (DiffF), which incorporates explicit frame differences detected via computer vision techniques. We evaluate these methods using a basic self-curated dataset and an advanced benchmark adapted from prior work. Our results show that the DF approach achieves an accuracy of 70% to 80% in identifying user actions, with the extracted action sequences being re-playable though Robotic Process Automation. We find that while VLMs show potential, incorporating explicit UI changes can degrade performance, making the DF approach more reliable. This work represents the first application of VLMs for extracting user action sequences from desktop recordings, contributing new methods, benchmarks, and insights for future research.
Mind-proofing Your Phone: Navigating the Digital Minefield with GreaseTerminator
Datta, Siddhartha, Kollnig, Konrad, Shadbolt, Nigel
Digital harms are widespread in the mobile ecosystem. As these devices gain ever more prominence in our daily lives, so too increases the potential for malicious attacks against individuals. The last line of defense against a range of digital harms - including digital distraction, political polarisation through hate speech, and children being exposed to damaging material - is the user interface. This work introduces GreaseTerminator to enable researchers to develop, deploy, and test interventions against these harms with end-users. We demonstrate the ease of intervention development and deployment, as well as the broad range of harms potentially covered with GreaseTerminator in five in-depth case studies.
Decomposed Inductive Procedure Learning
Weitekamp, Daniel, MacLellan, Christopher, Harpstead, Erik, Koedinger, Kenneth
Recent advances in machine learning have made it possible to train artificially intelligent agents that perform with super-human accuracy on a great diversity of complex tasks. However, the process of training these capabilities often necessitates millions of annotated examples -- far more than humans typically need in order to achieve a passing level of mastery on similar tasks. Thus, while contemporary methods in machine learning can produce agents that exhibit super-human performance, their rate of learning per opportunity in many domains is decidedly lower than human-learning. In this work we formalize a theory of Decomposed Inductive Procedure Learning (DIPL) that outlines how different forms of inductive symbolic learning can be used in combination to build agents that learn educationally relevant tasks such as mathematical, and scientific procedures, at a rate similar to human learners. We motivate the construction of this theory along Marr's concepts of the computational, algorithmic, and implementation levels of cognitive modeling, and outline at the computational-level six learning capacities that must be achieved to accurately model human learning. We demonstrate that agents built along the DIPL theory are amenable to satisfying these capacities, and demonstrate, both empirically and theoretically, that DIPL enables the creation of agents that exhibit human-like learning performance.
Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?
Rozanova, Julia, Ferreira, Deborah, Dubba, Krishna, Cheng, Weiwei, Zhang, Dell, Freitas, Andre
Models designed for intelligent process automation are required to be capable of grounding user interface elements. This task of interface element grounding is centred on linking instructions in natural language to their target referents. Even though BERT and similar pre-trained language models have excelled in several NLP tasks, their use has not been widely explored for the UI grounding domain. This work concentrates on testing and probing the grounding abilities of three different transformer-based models: BERT, RoBERTa and LayoutLM. Our primary focus is on these models' spatial reasoning skills, given their importance in this domain. We observe that LayoutLM has a promising advantage for applications in this domain, even though it was created for a different original purpose (representing scanned documents): the learned spatial features appear to be transferable to the UI grounding setting, especially as they demonstrate the ability to discriminate between target directions in natural language instructions.
Neural Networks Art: Solving Problems with Multiple Solutions and New Teaching Algorithm
The human brain processes information flows continuously from the external environment. However, it can modify and update the stored images, and create new, without destroying what previously memorized. Thus it differs significantly from the majority of neural networks as neural networks (NN), trained by back propagation, genetic algorithms, in bidirectional associative memory, Hopfield networks, etc. very often a new way of learning, situation or association significantly distorts or even destroys the fruits of prior learning, requiring a change in a significant part of weights of connections or complete ret raining of the network [1-4]. Impossibility of using the specified NN solve the problem of stability-plasticity, that is a problem of perception and memorization of new information without loss or distortion of existing, was one of the main reasons for the development of fundamentally new configurations of neural networks. Examples of such networks are neural networks, derived from the adaptive resonance theory (ART), developed by Carpenter and Grossberg [5, 6].
Scout: Rapid Exploration of Interface Layout Alternatives through High-Level Design Constraints
Swearngin, Amanda, Wang, Chenglong, Oleson, Alannah, Fogarty, James, Ko, Amy J.
Although exploring alternatives is fundamental to creating better interface designs, current processes for creating alternatives are generally manual, limiting the alternatives a designer can explore. We present Scout, a system that helps designers rapidly explore alternatives through mixed-initiative interaction with high-level constraints and design feedback. Prior constraint-based layout systems use low-level spatial constraints and generally produce a single design. Tosupport designer exploration of alternatives, Scout introduces high-level constraints based on design concepts (e.g.,~semantic structure, emphasis, order) and formalizes them into low-level spatial constraints that a solver uses to generate potential layouts. In an evaluation with 18 interface designers, we found that Scout: (1) helps designers create more spatially diverse layouts with similar quality to those created with a baseline tool and (2) can help designers avoid a linear design process and quickly ideate layouts they do not believe they would have thought of on their own.